Search CORE

8 research outputs found

Automatic parallel implementations of adjoint codes for structured mesh applications

Author: Balogh Gábor Dániel
Reguly István Zoltán
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Algorithmic Differentiation (AD) shown to be an essential tool to get sensitivity information for va in multiple areas of science such as Computational Fluid Dynamics (CFD) applications or finance. Yet there is no sufficient tool to ease the cost of providing performance portable AD codes, especially for modern hardware like GPU clusters. This paper sketches our plans and progress so far to extend the OPS framework with an adjoint tape (storage for descriptors of intermediate steps and intermediate states of variables) and shows preliminary performance results on CPU nodes. The OPS (Oxford Parallel library for Structured mesh solvers) has shown good performance and scaling on a wide range of HPC architectures. Our work aims to exploit the benefits of OPS to provide performance portable adjoint implementations for future structured mesh stencil applications using OPS with minimal modifications

Crossref

Repository of the Academy's Library

Bitwise Reproducible task execution on unstructured mesh applications

Author: R. Mudalige Gihan
Reguly István Zoltán
Siklósi Bálint
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Many mesh applications use floating point arithmetic which do not necessarily hold the associative laws of algebra. This could cause the application to become unreproducible. In this paper we present some work on generating a method for unstructured mesh applications to provide bitwise reproducibility between separate runs, even if they are started with different number of MPI processes. We implement our work in the OP2 domain-specific library, which provides an API that abstracts the solution of unstructured mesh computations. We carry out a performance analysis of our method applied on two applications: a simple airfoil application, and a more complex Aero application which uses a finite element method and a conjugate-gradient algorithm. We show a 2.37×to 1.49× slowdown on this applications as a price for full bitwise reproducibility

Crossref

Repository of the Academy's Library

Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS

Author: Giles M.
Mudalige G.R.
Reguly István Zoltán
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2

\times

on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively,

3.5\times

on the linear solver TeaLeaf, and

1.7\times

on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information

arXiv.org e-Print Archive

Warwick Research Archives Portal Repository

Oxford University Research Archive

Repository of the Academy's Library

Heterogeneous CPU-GPU Execution of Stencil Applications

Author: Mudalige Gihan
Reguly István Zoltán
Siklosi Bálint
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Crossref

Repository of the Academy's Library

An abstraction for local computations on structured meshes and its extension to handling multiple materials

Author: Becker Daniel
Mudalige Gihan R.
Reguly István Zoltán
Publication venue: VDE
Publication date: 01/01/2018
Field of study

Computations involving a neighbourhood on structured meshes represents a wide class of applications that includes the simulation of cellular automata, and the solution of partial differential equations (PDEs). In this paper we present an abstraction for describing such computations at a high level, allowing fast experimentation and productivity. The abstraction is designed such that it can be automatically converted to various high-performance implementations. A critical feature of this abstraction is an extension to support a varying number of materials, or species, at each grid point, enabling much more complex simulations

Repository of the Academy's Library

Beyond 16GB: Out-of-Core Stencil Computations

Author: Giles Michael B.
Mudalige Gihan R.
Reguly István Zoltán
Publication venue: ACM Press
Publication date: 01/01/2017
Field of study

Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15\% loss in efficienc

arXiv.org e-Print Archive

Warwick Research Archives Portal Repository

Oxford University Research Archive

Repository of the Academy's Library

ABSTRACTION AND IMPLEMENTATION OF UNSTRUCTURED GRID ALGORITHMS ON MASSIVELY PARALLEL HETEROGENEOUS ARCHITECTURES

Author: Reguly István Zoltán
Publication venue
Publication date: 01/01/2014
Field of study

REAL-PhD